Q1: Paper Review¶

Title: Batch Normalization

Author: Sergey Ioffe, Christian Szegedy

Main Idea and Motivation:

The paper discusses Batch Normalization, a mechanism designed to accelerate the training of deep networks. The authors identify internal covariate shift as a major hindrance to the training of machine learning systems and propose Batch Normalization as a solution.

Summary:

The authors present Batch Normalization as a technique for mitigating the issue of internal covariate shift in deep learning networks. This technique normalizes the inputs for each layer of the network, reducing the amount of shift. The authors also note that the normalization process introduces a level of noise into each layer's inputs. They argue that this noise has a regularizing effect and can make the network less likely to overfit the data.

Approach and Contributions:

The authors employed a combination of analytical and empirical methods. They introduced the concept of Batch Normalization, providing a detailed algorithm for its implementation. The analytical approach was supported by empirical analysis, where they conducted experiments to validate their theory.

The primary approach used by the authors was to predict the digit class on the MNIST dataset using a simple network with a 28x28 binary image as input and 5 layers of 100 rectified linear hidden units. This was done to confirm the effect of internal covariate shift on training and the ability of Batch Normalization to counteract it.

The main findings and arguments made by the authors are that Batch Normalization can combat internal covariate shift, thereby improving the training of deep learning networks and accelerating the process. The results show that Batch Normalization can make the network training more resilient to the scale of initialization and much more capable of utilizing high learning rates. When training with Batch Normalization, a training example is seen along with other examples in the mini-batch, leading to the training network no longer generating deterministic values for a given training example. This effect was found to be beneficial for the generalization of the network. In comparison to Dropout, which is typically used to reduce overfitting, Batch Normalization can either eliminate or decrease its strength.

These contributions are pivotal for machine learning and its applications as they offer a practical solution for the efficient training of deep learning models. By managing the issue of internal covariate shift, the authors have opened up possibilities for accelerating the training process and improving the performance of these models.

The paper builds upon the existing work on deep learning and the challenges of training such models. Specifically, it addresses the issue of internal covariate shift, which complicates the training process. The authors' work contributes to the ongoing research in this area by providing a novel mechanism for addressing this problem.

Areas for Improvements:

Tested on limited dataset is a weakness.

Explore Batch Normalization's effect on gradient propagation.

Provide a detailed explanation of the regularization mechanism.

Validate results on more complex datasets.

Compare Batch Normalization with other normalization techniques to understand its advantages and disadvantages.

Q2: Train two different models using sklearn to learn each of the following gates: Logistic Regression, Support Vector Machine (SVC)¶

Q2.1: Train Logistic Regression model using sklearn to learn each of the following gates: AND, OR, XOR, NAND.¶

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# Define the input data and labels for each gate
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

# AND Gate
y_and = np.array([0, 0, 0, 1])
model_and = LogisticRegression(solver='lbfgs', max_iter=10000, C=1000)  # Increase C to reduce regularization, making the model more flexible and allowing it to fit the training data more closely
model_and.fit(X, y_and)
predictions_and = model_and.predict(X)
print(f"AND Gate Predictions: {predictions_and}")

# Inspect the coefficients and intercept
print(f"AND Gate Coefficients: {model_and.coef_}, Intercept: {model_and.intercept_}")
AND Gate Predictions: [0 0 0 1]
AND Gate Coefficients: [[8.7581697 8.7581697]], Intercept: [-13.48605578]
In [2]:
# OR Gate
y_or = np.array([0, 1, 1, 1])
model_or = LogisticRegression(solver='lbfgs', max_iter=10000, C=1000)  # Increase C to reduce regularization, making the model more flexible and allowing it to fit the training data more closely
model_or.fit(X, y_or)
predictions_or = model_or.predict(X)
print(f"OR Gate Predictions: {predictions_or}")

# Inspect the coefficients and intercept
print(f"OR Gate Coefficients: {model_or.coef_}, Intercept: {model_or.intercept_}")
OR Gate Predictions: [0 1 1 1]
OR Gate Coefficients: [[8.75177947 8.75177947]], Intercept: [-4.02582316]
In [3]:
# XOR Gate
y_xor = np.array([0, 1, 1, 0])
model_xor = LogisticRegression(solver='lbfgs', max_iter=10000, C=1000)  # Increase C to reduce regularization, making the model more flexible and allowing it to fit the training data more closely
model_xor.fit(X, y_xor)
predictions_xor = model_xor.predict(X)
print(f"XOR Gate Predictions: {predictions_xor}")

# Inspect the coefficients and intercept
print(f"XOR Gate Coefficients: {model_xor.coef_}, Intercept: {model_xor.intercept_}")
XOR Gate Predictions: [0 0 0 0]
XOR Gate Coefficients: [[0. 0.]], Intercept: [0.]
In [4]:
# NAND Gate
y_nand = np.array([1, 1, 1, 0])
model_nand = LogisticRegression(solver='lbfgs', max_iter=10000, C=1000)  # Increase C to reduce regularization, making the model more flexible and allowing it to fit the training data more closely
model_nand.fit(X, y_nand)
predictions_nand = model_nand.predict(X)
print(f"NAND Gate Predictions: {predictions_nand}")

# Inspect the coefficients and intercept
print(f"NAND Gate Coefficients: {model_nand.coef_}, Intercept: {model_nand.intercept_}")
NAND Gate Predictions: [1 1 1 0]
NAND Gate Coefficients: [[-8.7581697 -8.7581697]], Intercept: [13.48605578]

AND Gate: Logistic Regression is able to learn this gate since it is linearly separable.¶

OR Gate: Logistic Regression is able to learn this gate since it is linearly separable.¶

XOR Gate: Logistic Regression is not able to learn this gate since it is not linearly separable. There is no straight line that can separate the classes correctly since the output classes are interleaved. A single-layer model like Logistic Regression (which is a linear classifier) cannot capture this pattern. It requires a model that can capture non-linear relationships, such as a multi-layer neural network.¶

NAND Gate: Logistic Regression is able to learn this gate since it is linearly separable.¶

Q2.2: Train Support Vector Machine (SVC) model using sklearn to learn each of the following gates: AND, OR, XOR, NAND.¶

In [5]:
import numpy as np
from sklearn.svm import SVC

# Define the input data and labels for each gate
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

# AND Gate
y_and = np.array([0, 0, 0, 1])
model_and = SVC(kernel='rbf', gamma='scale', C=1000) # scale is the default value for gamma and is recommended for small datasets
model_and.fit(X, y_and)
predictions_and = model_and.predict(X)
print(f"AND Gate Predictions: {predictions_and}")

# OR Gate
y_or = np.array([0, 1, 1, 1])
model_or = SVC(kernel='rbf', gamma='scale', C=1000)
model_or.fit(X, y_or)
predictions_or = model_or.predict(X)
print(f"OR Gate Predictions: {predictions_or}")

# XOR Gate
y_xor = np.array([0, 1, 1, 0])
model_xor = SVC(kernel='rbf', gamma='scale', C=1000)
model_xor.fit(X, y_xor)
predictions_xor = model_xor.predict(X)
print(f"XOR Gate Predictions: {predictions_xor}")

# NAND Gate
y_nand = np.array([1, 1, 1, 0])
model_nand = SVC(kernel='rbf', gamma='scale', C=1000)
model_nand.fit(X, y_nand)
predictions_nand = model_nand.predict(X)
print(f"NAND Gate Predictions: {predictions_nand}")
AND Gate Predictions: [0 0 0 1]
OR Gate Predictions: [0 1 1 1]
XOR Gate Predictions: [0 1 1 0]
NAND Gate Predictions: [1 1 1 0]

I used SVC(kernel='rbf') to learn all of the four gates (AND, OR, XOR, NAND). The radial basis function (RBF) kernel, also known as the Gaussian kernel, can handle both linearly separable and non-linearly separable problems. This means it can effectively learn the patterns of all four logic gates.¶

Q3: Use randomly generated data with at least 1000 observations to demonstrate how the ROC curve differs from (Precision-Recall) PR curve for unbalanced datasets. Plot both curves on different levels of imbalance between the two classes (e.g. 50/50, 75/25, 90/10).¶

In [6]:
# https://en.wikipedia.org/wiki/Receiver_operating_characteristic

image.png

example of ROC¶

image.png!

Precision¶

The ratio of correctly predicted positive observations to all predicted positives is known as precision.

It gauges how well the model forecasts the positive outcomes.

Precision=

(TruePositives+FalsePositives)/

TruePositives

Precision is concerned with the quality of positive predictions. A high precision indicates that the model has a low rate of false positives.

Recall¶

The ratio of correctly predicted positive observations to the total number of actual positive observations is known as recall.

It gauges how well the model can capture each pertinent instance.

Precision=

(TruePositives+FalseNegatives)/

TruePositives

Recall is concerned with the quantity of the relevant instances captured by the model. A high recall indicates that the model has a low rate of false negatives.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, precision_recall_curve, auc
In [8]:
def plot_roc_pr_curves(X, y, title):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    model = LogisticRegression(solver='lbfgs', max_iter=1000)
    model.fit(X_train, y_train)
    y_scores = model.predict_proba(X_test)[:, 1]
    
    fpr, tpr, _ = roc_curve(y_test, y_scores)
    roc_auc = auc(fpr, tpr)
    
    precision, recall, _ = precision_recall_curve(y_test, y_scores)
    pr_auc = auc(recall, precision)
    
    plt.figure(figsize=(14, 6))
    
    # ROC Curve
    plt.subplot(1, 2, 1)
    plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {title}')
    plt.legend(loc='lower right')
    
    # PR Curve
    plt.subplot(1, 2, 2)
    plt.plot(recall, precision, label=f'PR Curve (AUC = {pr_auc:.2f})')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve - {title}')
    plt.legend(loc='lower left')
    
    plt.show()

# Generate and plot for different levels of imbalance
for ratio in [0.5, 0.25, 0.1]:
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, 
                               n_clusters_per_class=1, weights=[ratio], flip_y=0, random_state=42)
    plot_roc_pr_curves(X, y, f'Class Imbalance {int(ratio*100)}/{int((1-ratio)*100)}')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [9]:
# Balanced dataset example
X_balanced, y_balanced = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, 
                                             n_clusters_per_class=1, weights=[0.5, 0.5], flip_y=0, random_state=42)
plot_roc_pr_curves(X_balanced, y_balanced, 'Class Imbalance 50/50')
No description has been provided for this image

Q3.1: Which metric would you prefer for balanced vs imbalanced datasets and why?¶

Balanced Datasets:

ROC Curve is generally preferred because it provides a good measure of how well the model can distinguish between the two classes. Since the dataset is balanced, both precision and recall are equally important, and the ROC curve effectively captures this.

Imbalanced Datasets:

Precision-Recall (PR) Curve is preferred because it focuses on the performance with respect to the positive class (usually the minority class in imbalanced datasets). The PR curve is more informative in scenarios where the number of true negatives is high, and we want to evaluate how well the model identifies the positive class.

Q3.2: Provide a real-world scenario where you would apply the above reasoning¶

Balanced Dataset:

In a large-scale screening study for hypertension, where the prevalence of hypertension is around 50%, the dataset is balanced. Both true positive (correctly identifying hypertensive patients) and true negative (correctly identifying non-hypertensive patients) are equally important. In this scenario, the ROC curve is appropriate because it provides a comprehensive evaluation of the model's performance across all classification thresholds, ensuring accurate diagnosis for both hypertensive and non-hypertensive cases.

Imbalanced Dataset:

A medical diagnostic test for a HIV where the number of healthy individuals far exceeds the number of HIV affected patients. In this case, the PR curve would be more informative as it focuses on the performance with respect to the positive class (HIV detection), which is more critical in this scenario.

Q4: Build a small neural network using Tensorflow without using the Keras API¶

Generate a regression dataset of your choice with some added noise.¶

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# Generate a regression dataset
np.random.seed(42)
X = np.linspace(-1, 1, 100).reshape(-1, 1)
y = 3 * X + np.random.normal(0, 0.1, X.shape)

plt.scatter(X, y)
plt.title('Regression Dataset with Noise')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
No description has been provided for this image
In [11]:
X.shape
Out[11]:
(100, 1)

Train the ANN using the TF GradientTape function for backpropagation.¶

The tf.GradientTape function is used within the training loop for backpropagation.¶

In [12]:
# Initialize weights and biases
W1 = tf.Variable(np.random.randn(1, 10), dtype=tf.float32) # a shape of (1, 10) filled with random numbers from a normal distribution, and it ensures that the data type is float32
b1 = tf.Variable(np.zeros(10), dtype=tf.float32)
W2 = tf.Variable(np.random.randn(10, 1), dtype=tf.float32)
b2 = tf.Variable(np.zeros(1), dtype=tf.float32)
In [13]:
W1
Out[13]:
<tf.Variable 'Variable:0' shape=(1, 10) dtype=float32, numpy=
array([[-1.4153707 , -0.42064533, -0.34271452, -0.80227727, -0.16128571,
         0.40405086,  1.8861859 ,  0.17457782,  0.2575504 , -0.07444592]],
      dtype=float32)>
In [14]:
b1
Out[14]:
<tf.Variable 'Variable:0' shape=(10,) dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>
In [15]:
W2
Out[15]:
<tf.Variable 'Variable:0' shape=(10, 1) dtype=float32, numpy=
array([[-1.9187713 ],
       [-0.02651387],
       [ 0.06023021],
       [ 2.463242  ],
       [-0.19236097],
       [ 0.30154735],
       [-0.03471177],
       [-1.168678  ],
       [ 1.1428229 ],
       [ 0.75193304]], dtype=float32)>
In [16]:
b2
Out[16]:
<tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>
In [17]:
W1.shape
Out[17]:
TensorShape([1, 10])
In [18]:
# Define forward pass
def forward_pass(X):
    hidden = tf.nn.relu(tf.add(tf.matmul(X, W1), b1)) # Apply the ReLU activation in the hidden layer to introduce non-linearity
    output = tf.add(tf.matmul(hidden, W2), b2)
    return output

# Define the loss function (Mean Squared Error)
def compute_loss(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

# Define training parameters
learning_rate = 0.01
epochs = 1000

# Convert the NumPy array X into a TensorFlow constant with data type float32. This conversion is necessary because TensorFlow operations work with TensorFlow data structures.
X_tf = tf.constant(X, dtype=tf.float32) 
y_tf = tf.constant(y, dtype=tf.float32)

# A list that will store the loss values computed at each epoch during the training process. Tracking the loss history helps us monitor the model's performance and convergence over time
loss_history = []
In [19]:
X_tf.shape
Out[19]:
TensorShape([100, 1])
In [20]:
y_tf.shape
Out[20]:
TensorShape([100, 1])
In [21]:
# Training loop
for epoch in range(epochs):
    with tf.GradientTape() as tape: # The tf.GradientTape context is used to record operations for automatic differentiation.
        y_pred = forward_pass(X_tf)
        loss = compute_loss(y_tf, y_pred)
    # The tape.gradient method computes the gradients of the loss with respect to the trainable variables (W1, b1, W2, b2).
    # These gradients indicate how much each variable should be adjusted to reduce the loss.
    gradients = tape.gradient(loss, [W1, b1, W2, b2])
    # The assign_sub method updates each variable by subtracting the product of the learning rate and the corresponding gradient.
    # This step is the core of the gradient descent optimization process, where weights and biases are adjusted to minimize the loss.
    W1.assign_sub(learning_rate * gradients[0])
    b1.assign_sub(learning_rate * gradients[1])
    W2.assign_sub(learning_rate * gradients[2])
    b2.assign_sub(learning_rate * gradients[3])
    # The current loss value is appended to the loss_history list, which stores the loss values for each epoch.
    loss_history.append(loss.numpy())

# Plot the results
plt.plot(X, y, 'b.', label='Data')
plt.plot(X, y_pred.numpy(), 'r-', label='Prediction')
plt.title('Regression Results')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Plot the loss curve
plt.plot(loss_history)
plt.title('Loss Curve')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
No description has been provided for this image
No description has been provided for this image

The initial model above was a simple neural network with one hidden layer.

The model was able to fit the data with some noise reasonably well, as seen in the regression results and the loss curve.

Increase the Level of Noise and Train the ANN¶

In [22]:
# Increase noise in the dataset
y_noisy = 3 * X + np.random.normal(0, 0.5, X.shape)  # Increased noise

plt.scatter(X, y_noisy)
plt.title('Regression Dataset with Increased Noise')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
No description has been provided for this image
In [23]:
# Initialize new weights and biases for more complex model
W1 = tf.Variable(np.random.randn(1, 10), dtype=tf.float32)
b1 = tf.Variable(np.zeros(10), dtype=tf.float32)
W2 = tf.Variable(np.random.randn(10, 10), dtype=tf.float32)
b2 = tf.Variable(np.zeros(10), dtype=tf.float32)
W3 = tf.Variable(np.random.randn(10, 1), dtype=tf.float32)
b3 = tf.Variable(np.zeros(1), dtype=tf.float32)
In [24]:
X.shape
Out[24]:
(100, 1)
In [25]:
y_noisy.shape
Out[25]:
(100, 1)
In [26]:
W1
Out[26]:
<tf.Variable 'Variable:0' shape=(1, 10) dtype=float32, numpy=
array([[ 2.3146586 , -1.8672652 ,  0.68626016, -1.6127158 , -0.47193187,
         1.0889506 ,  0.06428002, -1.0777447 , -0.7153037 ,  0.67959774]],
      dtype=float32)>
In [27]:
W1.shape
Out[27]:
TensorShape([1, 10])
In [28]:
b1.shape
Out[28]:
TensorShape([10])
In [29]:
W2.shape
Out[29]:
TensorShape([10, 10])
In [30]:
b2.shape
Out[30]:
TensorShape([10])
In [31]:
W3.shape
Out[31]:
TensorShape([10, 1])
In [32]:
b3.shape
Out[32]:
TensorShape([1])

X.shape is (100, 1) and W1.shape is TensorShape([1, 10]), why Hidden Layer 1 has 10 neurons?

In a neural network, the number of neurons in a hidden layer is determined by the shape of the weight matrix connecting the input layer (or the previous hidden layer) to that hidden layer.

In this case, X.shape is (100, 1) which means I have 100 samples (or data points) each with 1 feature. When I multiply X with W1, the shape of W1 is (1, 10), which means I am mapping the 1 input feature to 10 hidden units. Each column in W1 represents the weights from the input feature to a particular hidden unit. Therefore, I have 10 hidden units in the first hidden layer.

Here's how the dimensions work out:

X has shape (100, 1)

W1 has shape (1, 10)

When I multiply X with W1, I get a hidden layer activation of shape (100, 10) because each of the 100 samples is mapped to the 10 hidden units. In the activation matrix of shape (100, 10), each row represents one of the 100 samples, and each column represents one of the 10 neurons in the hidden layer. Each element in the matrix represents the activation value of a particular neuron for a specific sample.

Therefore, based on the shape of W1, the first hidden layer has 10 neurons.

In [33]:
# Define forward pass for the more complex model
def forward_pass_complex(X):
    hidden1 = tf.nn.relu(tf.add(tf.matmul(X, W1), b1))
    hidden2 = tf.nn.relu(tf.add(tf.matmul(hidden1, W2), b2))
    output = tf.add(tf.matmul(hidden2, W3), b3)
    return output

# Training loop for the complex model
y_noisy_tf = tf.constant(y_noisy, dtype=tf.float32)
loss_history_complex = []

for epoch in range(epochs):
    with tf.GradientTape() as tape:
        y_pred_complex = forward_pass_complex(X_tf)
        loss_complex = compute_loss(y_noisy_tf, y_pred_complex)
    gradients_complex = tape.gradient(loss_complex, [W1, b1, W2, b2, W3, b3])
    W1.assign_sub(learning_rate * gradients_complex[0])
    b1.assign_sub(learning_rate * gradients_complex[1])
    W2.assign_sub(learning_rate * gradients_complex[2])
    b2.assign_sub(learning_rate * gradients_complex[3])
    W3.assign_sub(learning_rate * gradients_complex[4])
    b3.assign_sub(learning_rate * gradients_complex[5])
    loss_history_complex.append(loss_complex.numpy())

# Plot the results for the complex model
plt.plot(X, y_noisy, 'b.', label='Data with Noise')
plt.plot(X, y_pred_complex.numpy(), 'r-', label='Prediction')
plt.title('Regression Results with Increased Noise')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Plot the loss curve for the complex model
plt.plot(loss_history_complex)
plt.title('Loss Curve for Complex Model')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
No description has been provided for this image
No description has been provided for this image

Summary

Adding an additional layer to the neural network increased its capacity to handle more complex patterns in the data.

The more complex model shown below was able to fit the noisier dataset better, as seen in the regression results and the loss curve.

In [ ]:
Input (1 neuron)   Hidden Layer 1 (10 neurons)   Hidden Layer 2 (10 neurons)   Output (1 neuron)
   |                        |                             |                          |
   |                        |                             |                          |
   --------------------------                             ----------------------------
              |                                                     |
              |                                                     |
              -------------------------------------------------------

Q5: Design an optimal solution, i.e. the smallest possible neural network with the architecture and hyperparameters for the spiral problem.¶

Base model

image.png

Add L2 Regularization

image.png

Add more neurons in a layer without regularization

image.png

Add more neurons in a layer with regularization

I observed overfitting (the model performs well on training data but poorly on test data), so I introduced regularization. Regularization Rate: start with a small rate like 0.001 and increase if overfitting persists. Regularization helps to penalize larger weights and can lead to better generalization.

image.png

Decrease regularization Rate

image.png

Add more neurons in a layer - final model

image.png

Option 1: 2 Hidden Layers, Each with 8 Neurons

Advantages:

Simplicity: A straightforward architecture that is relatively easy to understand and implement.

Fewer Parameters: Compared to deeper networks, this configuration will have fewer parameters, which can reduce the risk of overfitting and make the model easier to train.

Potential for Faster Training: With fewer layers, the model may train faster due to the reduced complexity.

Drawbacks:

Capacity: This architecture might not capture very complex patterns as effectively as a deeper network with more layers. For some complex problems, it might underperform.

Option 2: 3 Hidden Layers, Each with 3 or 4 Neurons

Advantages:

Increased Depth: More layers can help the model learn more complex patterns and representations, which can be beneficial for capturing the intricacies of the data.

Regularization by Depth: A deeper network can act as a form of regularization, helping the model generalize better by breaking down the learning process into more steps.

Drawbacks:

Potential for Overfitting: If the dataset is small or not very complex, adding more layers can increase the risk of overfitting.

Longer Training Time: More layers mean more parameters to optimize, which can increase the training time and computational requirements.

Decision Factors Complexity of the Problem: If the problem is highly complex and non-linear, the additional depth of the second option might be necessary. However, if the problem is simpler, the first option might suffice.

Training Data: If have a large amount of high-quality training data, a more complex model (option 2) might be beneficial. For smaller datasets, the simpler model (option 1) is less likely to overfit.

Computational Resources: If have limited computational resources or need faster training times, option 1 might be more practical.

Performance Metrics: Evaluate both models on validation dataset using metrics like accuracy, precision, recall, and F1 score to see which model performs better in practice.

Recommendation:

I would choose 2 Hidden Layers, Each with 8 Neurons into production in this case. The problem is non-linear but it follows spiral which is not highly complex. I also have limited computational resources. 2 Hidden Layers, Each with 8 Neurons also performs better in test loss.